AITopics | sde approximation

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Itô Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., Li et al., 2019) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Itô SDE approximation.

approximation, modeling sgd, stochastic differential equation, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.41)

Add feedback

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Neural Information Processing SystemsDec-24-2025, 00:12:57 GMT

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

adaptive gradient algorithm, name change, sde and scaling rule, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

main

Sadhika Malladi

Neural Information Processing SystemsAug-14-2025, 23:32:32 GMT

accuracy, approximation, noise, (17 more...)

Neural Information Processing Systems

Genre: Workflow (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

main

Sadhika Malladi

Neural Information Processing SystemsAug-14-2025, 23:32:28 GMT

LR, is important for good generalization in real-life deep nets.

approximation, sde approximation, sgd, (14 more...)

Neural Information Processing Systems

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada > British Columbia > Vancouver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

32ac710102f0620d0f28d5d05a44fe08-Paper-Conference.pdf

Neural Information Processing SystemsAug-14-2025, 04:42:16 GMT

approximation, definition 2, international conference, (12 more...)

Neural Information Processing Systems

Country: North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs)

Neural Information Processing SystemsOct-10-2024, 22:53:11 GMT

It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Itô Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., Li et al., 2019) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Itô SDE approximation. Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.

approximation, sde approximation, stochastic differential equation, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Mathematics of Computing (0.66)
Information Technology > Artificial Intelligence > Machine Learning (0.46)

Add feedback

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Neural Information Processing SystemsOct-10-2024, 14:10:05 GMT

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a square root scaling rule to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

adaptive gradient algorithm, rmsprop and adam, sde and scaling rule, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Malladi, Sadhika, Lyu, Kaifeng, Panigrahi, Abhishek, Arora, Sanjeev

arXiv.org Artificial IntelligenceFeb-13-2023

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2205.10287

Country: North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report > New Finding (0.45)

Technology: